Finding the best wine for your budget Walking into the liquor store, the wine section can be a daunting experience. As with any shopping experience you want to know that you are getting the best thing for your budget. If you like what you got, you probably would want to try something similar as well. There are many varieties of wine and the prices range from a few dollars to a few thousands. What if there was a way to make wine purchasing more efficient? This tutorial is broken down into two parts. The first section handles finding the best wine for a specific price range for the variety of wine of your choosing. The second section handles finding the most similar wine to any given wine based on variety, points, and price. This dataset consists of 130 thousand listings of wines from around the world and are rated by professional wine tasters. The prices range from $4 t0 $3300 per bottle.
We believe that this is a good example of how data science can be applied to optimize everyday tasks. Not only saving us money but also giving our pallets something to enjoy. The research done in this tutorial can also be applied to far broader topics and things beyond just wine. The second half of our tutorial is used to find the most similar item to any given search items. Combined this tutorial gives us the tools that are useful for a multitude of online retailers and subscription-based services.
import pandas as pd
import matplotlib.pylab as plt
import matplotlib.axes as ax
import seaborn as sns
import numpy as np
wine_table = pd.read_csv('/winemag-data_first150k.csv')
#Provide the ability to show the entirety of the data when not looking at the head
pd.set_option("max_columns", None)
pd.set_option("max_rows", None)
wine_table.head()
# Finding unique varieties of wine in data
print('Variety:')
varieties = list(set(wine_table['variety'].to_list()))
wine_table['variety'].sort_values().unique()
# Displaying a sample of varieties
print(varieties[0:50])
# We will be using the index number from this set to act as the unique number identifier for each variety
print(varieties.index('Malbec'))
Here in the data frame below, one can see two Malbec wines with very similar points and prices, our goal is to be able to find the most similar wine to our search wine as possible.
First we need to create an (x,y,z) coordinate to represent the (variety,price,points)
# Given a specific wine, we want to be able to predict the wine closest or most similar to the specified wine based on variety, price and score
is_Malbec = wine_table['variety']=='Malbec'
malbec_table = wine_table[is_Malbec]
# Checking to see that filter worked
print('Malbec:')
print(malbec_table['variety'].sort_values().unique())
malbec_table.head()
# Creating points from wine rows to find euclidean distance
# Taking the example of an under 20 dollar bottle of Malbec (This corresponds to row 115 in the data set)
wine = wine_table.loc[115]
print(wine)
print("===================================")
print(wine['variety'])
print(varieties.index(wine['variety']))
print(int(wine['price']))
print(wine['points'])
print("===================================")
variety_id = varieties.index(wine['variety'])
# intializing point for search wine
search = np.array(((varieties.index(wine['variety'])), (int(wine['price'])), (wine['points'])))
print("search element:")
print(search)
Here we use the Euclidean distance between our search wine and any other wine to see which one is the most similar. As you can see, from the two wines below, a wine of the same variety and similar price and points will have a much smaller "distance" from our search wine than that of a wine of different variety, and less similar price and points
A more basic example using simple integers can be found at: https://www.w3resource.com/python-exercises/math/python-math-exercise-79.php
# Calculating Euclidean distance for two random wines
elem1 = wine_table.loc[86]
print("first random wine: ")
print(elem1)
element1 = np.array(((varieties.index(elem1['variety'])), (int(elem1['price'])), (elem1['points'])))
print("")
print("first random wine point: ")
print(element1)
print("====================================================================================")
elem2 = wine_table.loc[190]
print("second random wine: ")
print(elem2)
element2 = np.array(((varieties.index(elem2['variety'])), (int(elem2['price'])), (elem2['points'])))
print("")
print("second random wine point: ")
print(element2)
print("====================================================================================")
print("")
print("search element:")
print(search)
print("====================================================================================")
# printing Euclidean distance
print("")
print("Difference between first random wine and our wine")
dist1 = np.linalg.norm(search - element1)
print(dist1)
print("")
print("Difference between second random wine and our wine")
dist2 = np.linalg.norm(search - element2)
print(dist2)
Here we see the difference between wines, using this we can find the wine with the smallest difference from our search wine, not including itself
# looping through data-frame to find most similar wine and recording the wine with the smallest difference
print("Search Element:")
print("")
print(wine['variety'])
print(varieties.index(wine['variety']))
print(int(wine['price']))
print(wine['points'])
print("")
print(search)
print("")
print("====================================================================================")
print("Total number of Wines:")
print(len(wine_table.index))
print("")
# For the sake of saving calculation time and space I will be running the distance algorithm on a filtered set of wines
# This set is the set of wines of the same variety
print("Total number of Malbec wines:")
print(len(malbec_table.index))
# Here we could have used the entire table but for the sake of the tutorial and calculation time we will use the filered table
def findSimilar():
min_row = 0
min_dist = 100.0
for x in malbec_table.dropna().index:
elem = wine_table.loc[x]
point = np.array(((varieties.index(elem['variety'])), (int(elem['price'])), (elem['points'])))
dist = np.linalg.norm(search - point)
if dist < min_dist and dist > 0:
min_dist = dist
min_row = x
return min_row
##
similar = findSimilar()
print("Wine most similar to our input row is:")
most_sim = wine_table.loc[similar]
print(wine_table.loc[similar])
print("Search Wine:")
print("")
print(wine['variety'])
print(int(wine['price']))
print(wine['points'])
print("")
print(wine)
print("______________________________")
print("Most Similar Wine:")
print("")
print(most_sim['variety'])
print(int(most_sim['price']))
print(most_sim['points'])
print("")
print(most_sim)
print("")
#Create custom cut intervals for easier use by a customer
cut_labels = ['0-20', '20-50', '50-100', '100-200','>500']
cut_bins = [0, 20, 50, 100, 200,500]
wine_table['custom_price_range'] = pd.cut(wine_table['price'].to_list(), bins=cut_bins, labels=cut_labels)
#Lists of intervals and varieties that can be looped without duplicates
intervals = wine_table['custom_price_range']
intervals = list(dict.fromkeys(intervals))
variety = wine_table['variety']
variety = list(dict.fromkeys(variety))
for curr in intervals:
#Query of a mini data table where all custom price ranges match the curr in the interval loop
qry = wine_table[wine_table['custom_price_range'] == curr]
#Loop through varieties within this interval
varieties = qry['variety']
varieties = list(dict.fromkeys(varieties))
for var in varieties:
#For each variety in this custom price range, gather the price and points
qry2 = qry[qry['variety'] == var]
pay = qry2['price'].to_list()
points = qry2['points'].to_list()
pay_sum = 0
points_sum = 0
for p in pay:
pay_sum += p
for p in points:
points_sum += p
x = []
#Take the average of the price and points for this variety at this price range
x.append(pay_sum/len(pay))
y = []
y.append(points_sum/len(points))
plt.title('Varieties\' Average Points per Price in the Price Range: ' + curr)
plt.xlabel('Average Price')
plt.ylabel('Average Points')
plt.scatter(x, y)
plt.figure(figsize=(10, 8))
plt.show()
#violin
points = wine_table['points'].to_list()
wine_table['points_range'] = pd.cut(points, 10)
varieties = wine_table['variety']
varieties = list(dict.fromkeys(varieties))
#Create graphs for each variety and their target price for points
for var in varieties:
qry = wine_table[wine_table['variety'] == var]
plt.figure(figsize=(10, 8))
sns.violinplot(x=qry['points_range'], y=qry['price'], data=qry).set_title(var)
plt.xticks(rotation=45)
plt.show()
#Find the best wine for $15
target = 15.0
range = ''
if 0.0 <= target < 20.0:
range = '0-20'
elif 20.0 <= target < 50.0:
range = '20-50'
elif 50.0 <= target < 100.0:
range = '50-100'
elif 100.0 <= target < 200.0:
range = '100-200'
elif 200.0 <= target < 500.0:
range = '200-500'
else:
range = '>500'
points = 0
qry = wine_table[wine_table['custom_price_range'] == range]
for index, row in qry.iterrows():
if row['points'] > points:
points = row['points']
best_value = row
print(best_value)
#Show our updated data table with added columns
wine_table.head()